library(tidyverse)
library(gardenR)
library(lubridate)
library(ggthemes)
library(geofacet)
theme_set(theme_minimal())
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(week_day = wday(date, label = TRUE)) %>%
group_by(vegetable, week_day) %>%
summarize(tot_weight_lb = sum(weight) * 0.00220462) %>%
pivot_wider(id_cols = vegetable,
names_from = week_day,
values_from = tot_weight_lb)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
group_by(vegetable, variety) %>%
summarize(tot_weight_lb = sum(weight) * 0.00220462) %>%
left_join(garden_planting %>% select(c("vegetable", "variety", "plot")),
by = c("vegetable", "variety"))
For some vegetable varieties, there are multiple plots recorded for a single variety, and that’s why the number of rows increases after using the left_join(). This may be quite misleading. Take the Bush Bush Slender beans as an example. People might think that the total harvest of the variety is exactly the same in plot M and D, and that Lisa harvest around 22.13lb of beans at each plot. Yet that is not the case, and the problem is that there is no plot information in the garden harvest data.
It’s hard to find out a way to fix it.. Maybe the only way would be keeping track of at which plot were the beans harvested in the past.
I would like to understand how much money I “saved” by gardening, for each vegetable type. Describe how I could use the garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.
> It’ll be helpful to use the left_join() function, joining the costs from the garden_spending, together with the prices of vegetables in the whole foods market, to the garden_harvest dataset. Then we can mutate a new variable calculating the “revenue” from vegetables planted by multiplying the price from the whole foods market website and the weight of each vagetable variety. The amount of money “saved” can be finally calculated by subtracting cost from the revenue.
Subset the data to tomatoes. Reorder the tomato varieties from smallest to largest first harvest date. Create a barplot of total harvest in pounds for each variety, in the new order.
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(variety) %>%
summarize(tot_weight_lb = sum(weight) * 0.00220462,
first_date = min(date)) %>%
ggplot(aes(x = tot_weight_lb, y = fct_reorder(variety, first_date))) +
geom_col() +
labs(x = "",
y = "",
title = "Total harvest (lb) for each tomato variety",
subtitle = "Ordered by the first harvest date") +
theme(plot.title = element_text(face = "bold"))
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(lower_case_name = str_to_lower(variety),
variety_name_length = str_length(variety)) %>%
distinct(variety, .keep_all = TRUE) %>%
arrange(vegetable, variety_name_length)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
filter(str_detect(variety, "er|ar")) %>%
distinct(variety)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density(fill = "lightblue", color = NA, alpha = .6) +
labs(x = "",
y = "",
title = "Density estimate of bike-renting events' distribution by time")
Note: All the following interpretations may not perfectly match the corresponding plots (or tables), because I wrote them before accessing the full dataset.
There is not much information from this graph. In general, the density of events is higher in October than in other months, and decreases sharply starting from late October to late November. This indicates that the frequency of bike-renting events was much higher throughout most of the time in October 2014, and then, by November, people seemed less inclined to rent bikes.
The density rebounds in late November and continues increasing until mid-December, which means that bike-renting events happened more and more frequently starting from late November, 2014 until mid-December, 2014, out of unknown reasons. Yet the maximum density here is far less than that in October, and this trend did not last long: People’s inclination to rent bikes seemed to decrease again by mid-December and reached its minimum at the end of December.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_of_day = round(hour + minute/60, 2)) %>%
ggplot(aes(x = time_of_day)) +
geom_density(fill = "lightblue", color = NA, alpha = 0.6) +
labs(x = "",
y = "",
title = "Density Estimate of the Events' Distribution by Time of Day")
We can observe that the plot is a bit left skewed, with a center at around 12.5 (12:30pm), and is clearly in a bimodal pattern. This means that in late 2014, there were people who rented bikes late at night, and in general, the frequency of renting bikes reaches its highest point either at around 8 (8am) or around 17.5 (17:30), though the events tended to happen more frequently at around 17:30 than at about 8am.
Trips %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
ggplot(aes(y = fct_rev(fct_infreq(week_day)))) +
geom_bar(fill = "cadetblue4") +
labs(x = "",
y = "",
title = "Frequency of bike-renting events by weekday")
The frequency of the events didn’t vary much in different days of the week, according to the plot, but it’s still obvious that in the last quarter of 2014, more bike-renting events happened on Friday than all other days, and the total number of times the events happening is the least for Saturday and Sunday.
Yet this doesn’t mean that people were inclined more to rent a bike on Friday than other weekdays, for each arbitrary week, during that period of time. It’s possible that the events of bike-renting happened, with an anomalously high frequency, on one or several Fridays.
Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_of_day = round(hour + minute/60, 2),
week_day = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day)) +
geom_density(fill = "lightblue", color = NA, alpha = 0.6) +
facet_wrap(vars(week_day), nrow = 1) +
labs(x = "",
y = "",
title = "Density Estimate of the Events' Distribution by Time of Day")
Yes, there seems to be a pattern. The density estimates of the distribution of events in weekdays (Monday to Friday) are in a similar bimodal pattern, with one maximum at around 8am and the other at approximately 17:30pm. The density plots for Sunday and Saturday also share a similar pattern: The density (frequency of events) increases until 1am, reaches its lowest level at 5am, then increases drastically and reaches its highest point between 12:30pm and 15:00, and decreases monotonously until 24:00. The pattern shared by weekdays might be attributed to commuting, while it’s hard to appropriately elucidate the pattern shared by weekends.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_of_day = round(hour + minute/60, 2),
week_day = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day, fill = client)) +
geom_density(alpha = .5, color = NA) +
facet_wrap(vars(week_day), nrow = 1) +
labs(x = "",
y = "",
title = "Density Estimate of the Events' Distribution by Time of Day")
From this plot, we can see that casual and regular users shared a similar rental behavior pattern during the weekend, but they clearly had different rental behavior patterns during weekdays. In general, registered users followed the bimodal pattern during weekdays, yet casual renters seemed to prefer to rent bikes the most at only around 15:00, except Mondays, when they tended to choose to rent earlier.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_of_day = round(hour + minute/60, 2),
week_day = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day, fill = client)) +
geom_density(alpha = .5,
color = NA,
position = position_stack()) +
facet_wrap(vars(week_day), nrow = 1) +
labs(x = "",
y = "",
title = "Density Estimate of the Events' Distribution by Time of Day")
In terms of displaying rental behavior patterns and comparing them between casual and registered users, this stack position may render it worse in telling stories. Density plots for casual users don’t build up from level axes, due to the stack position, and this may lead to the situation that people misinterpret the overall density, which is outlined as the reddish color, as the density for renting events by casual users. For example, it may be the case that people will naturally interpret the renting behaviors, on Monday, of both types of users as fitting in a bimodal pattern. This position visually obscures both the behavior pattern of casual users and the key difference between clients, to some degree.
Yet this position makes it easier to have a general concept of proportions. This position allows us to better examine the proportion of each type of users’ renting frequency to the total frequency.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_of_day = round(hour + minute/60, 2),
week_day = wday(sdate),
weekend = ifelse(near(week_day, 7) | near(week_day, 1), "weekend", "weekday")) %>%
ggplot(aes(x = time_of_day, fill = client)) +
geom_density(alpha = .5,
color = NA) +
facet_wrap(vars(weekend), nrow = 1) +
labs(x = "",
y = "",
title = "Density Estimate of the Events' Distribution by Time of Day")
This graph somehow corroborates interpretations before, showing that casual and registered clients shared a similar renting behavior in weekend and had distinct behaviors in weekdays.
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_of_day = round(hour + minute/60, 2),
week_day = wday(sdate),
weekend = ifelse(near(week_day, 7) | near(week_day, 1), "weekend", "weekday")) %>%
ggplot(aes(x = time_of_day, fill = weekend)) +
geom_density(alpha = .5,
color = NA) +
facet_wrap(vars(client), nrow = 1) +
labs(x = "",
y = "",
title = "Density Estimate of the Events' Distribution by Time of Day") +
theme(legend.position = "bottom", legend.title = element_blank())
While the previous graph emphasizes the comparison between different kinds of clients’ rental behaviors, during weekdays and the weekend, this graph highlights the inner behavioral difference of each type of clients. The story here is that casual users seemed to have similar behaviors regardless of weekday, but regular users had much different behaviors during weekdays from behaviors on the weekend.
If we want to focus on the difference between the rental behaviors between two clients’ types, than it might be better to consider the previous graph, instead of this one.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Num_of_departure <- Trips %>%
group_by(sstation) %>%
mutate(depature_count = n())
Num_of_departure %>%
left_join(Stations %>% select(name, lat, long),
by = c("sstation" = "name")) %>%
ggplot(aes(x = long, y = lat, color = depature_count)) +
geom_point() +
labs(x = "",
y = "",
title = "Number of departures by geographic location")
Trips %>%
select(sstation, client) %>%
left_join(Stations %>% select(name, lat, long),
by = c("sstation" = "name")) %>%
ggplot(aes(x = long, y = lat, color = client)) +
geom_point(alpha = 0.3, size = 1.2) +
labs(x = "",
y = "",
title = "Depatures by two types of users",
subtitle = "By geographic location")
We may notice from the plot that casual users seemed to concentrate the most in areas with a latitude between 38.87 and 38.9, and a longitude between around -77.05 and -77.0.
as_date(sdate) converts sdate from date-time format to date format.Station_date_top_ten <- Trips %>%
mutate(date = as_date(sdate)) %>%
group_by(sstation, date) %>%
summarize(num_of_dep = n()) %>%
arrange(desc(num_of_dep)) %>%
head(10) %>%
select(sstation, date)
Station_date_top_ten
Trips %>%
mutate(date = as_date(sdate)) %>%
semi_join(Station_date_top_ten,
by = c("sstation" = "sstation", "date" = "date")) %>%
select(!date)
Trips %>%
mutate(date = as_date(sdate), week_day = wday(date, label = TRUE)) %>%
semi_join(Station_date_top_ten,
by = c("sstation" = "sstation", "date" = "date")) %>%
select(!date) %>%
group_by(client, week_day) %>%
summarize(num_of_trips = n()) %>%
mutate(prop = num_of_trips/sum(num_of_trips)) %>%
select(!num_of_trips) %>%
pivot_wider(names_from = client,
values_from = prop)
It can be seen that casual users intended to rent bikes the most on Saturday for trips whose departures match the top ten station-date combinations, and they preferred the least to implement renting behaviors on Wednesday. For registered users, they rented more on Wednesdays and Thursdays and the least on the weekend.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids %>%
filter(variable == "lib", near(year, 1997) | near(year, 2016)) %>%
mutate(per_child_dollars = inf_adj_perchild * 1000) %>%
ggplot(aes(x = year, y = per_child_dollars)) +
geom_line(arrow = arrow(length = unit(0.25, "cm")), color = "white") +
facet_geo(~state) +
labs(x = "",
y = "",
title = "Change in public spending on libraries from 1997 to 2016",
subtitle = "Thousands of dollars spent per child, adjusted for inflation",
caption = "Source: Urban Institute") +
theme_void() +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, ),
plot.background = element_rect(fill = "slategray3",
color = NA),
strip.text = element_text(color = "white", face = "bold"))